NZSA/ORSNZ 2016

Outline

  • Session 1: How to think about data, and getting started plotting with the grammar of graphics
  • Session 2: Wrangling data into shape, useful practices, cognitive perception
  • Session 3: Advanced graphics, layering, using maps

Using the Packages tidyr, dplyr

  • Writing readable code using pipes
  • What is tidy data? Why do you want tidy data? Getting your data into tidy form using tidyr.
  • Summarise, mutate, filter, select, arrange with dplyr
  • Reading different data formats
  • String operations, working with text
  • Re-structuring time variables
  • Computing on lists with purrr (not covered)
  • Handling missing values (not covered)

Pipes %>%

Pipes allow the code to be read like a sequence of operations

student <- readRDS("../data/student_sub.rds")
student %>% group_by(ST04Q01) %>% 
  summarise(m=mean(PV1MATH), s=sd(PV1MATH))
#> # A tibble: 2 × 3
#>   ST04Q01     m     s
#>     <dbl> <dbl> <dbl>
#> 1       1   484   100
#> 2       2   495   105

Example Data 1

What are the variables?

Inst AvNumPubs AvNumCits PctCompletion
ARIZONA STATE UNIVERSITY 0.90 1.57 32
AUBURN UNIVERSITY 0.79 0.64 44
BOSTON COLLEGE 0.51 1.03 47
BOSTON UNIVERSITY 0.49 2.66 34
BRANDEIS UNIVERSITY 0.30 3.03 49
BROWN UNIVERSITY 0.84 2.31 55

Example Data 2

What's in the column names of this data? What are the experimental units? What are the measured variables?

id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.2 2.20 4.2 2.63 5.1 4.5 5.5 4.4 3.9 4.2 3.73
Gene 2 1.5 0.59 1.9 0.52 2.9 1.4 3.0 1.3 1.2 1.2 0.89
Gene 3 2.0 0.87 3.3 0.53 4.6 2.2 5.6 2.5 2.5 3.0 1.35

Example Data 3

What are the variables? What are the records?

melbtemp <- read.fwf("../data/ASN00086282.dly", 
   c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)), fill=T)
kable(head(melbtemp[,c(1,2,3,4,seq(5,128,4))]))
V1 V2 V3 V4 V5 V9 V13 V17 V21 V25 V29 V33 V37 V41 V45 V49 V53 V57 V61 V65 V69 V73 V77 V81 V85 V89 V93 V97 V101 V105 V109 V113 V117 V121 V125
ASN00086282 1970 7 TMAX 141 124 113 123 148 149 139 153 123 108 119 112 126 112 115 133 134 126 104 143 141 134 117 142 158 149 133 143 150 145 115
ASN00086282 1970 7 TMIN 80 63 36 57 69 47 84 78 49 42 48 56 51 36 44 39 40 58 15 33 51 74 39 66 78 36 61 46 42 63 39
ASN00086282 1970 7 PRCP 3 30 0 0 36 3 0 0 10 23 3 0 5 0 0 0 0 0 8 0 18 0 0 0 0 13 3 0 25 0 3
ASN00086282 1970 8 TMAX 145 128 150 122 109 112 116 142 166 127 117 127 159 143 114 65 113 125 129 147 161 168 178 161 145 142 137 150 120 114 129
ASN00086282 1970 8 TMIN 50 61 75 67 41 51 48 -7 56 62 47 33 67 84 11 41 18 50 22 28 74 94 73 88 50 48 54 78 47 18 39
ASN00086282 1970 8 PRCP 0 66 0 53 13 3 8 0 0 0 3 5 0 0 64 3 99 36 8 0 0 0 8 36 25 30 56 5 69 3 20

Example Data 4

What are the variables? What are the experimental units?

tb <- read_csv("../data/tb.csv")
#tail(tb)
colnames(tb)
#>  [1] "iso2"   "year"   "m_04"   "m_514"  "m_014"  "m_1524" "m_2534"
#>  [8] "m_3544" "m_4554" "m_5564" "m_65"   "m_u"    "f_04"   "f_514" 
#> [15] "f_014"  "f_1524" "f_2534" "f_3544" "f_4554" "f_5564" "f_65"  
#> [22] "f_u"

Example Data 5

What are the variables? What are the experimental units?

pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)
kable(pew[1:5, 1:5])
religion <$10k $10-20k $20-30k $30-40k
Agnostic 27 34 60 81
Atheist 12 27 37 52
Buddhist 27 21 30 34
Catholic 418 617 732 670
Don’t know/refused 15 14 15 11

Example Data 6

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0

What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?

Messy Data Patterns

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.

  • Column headers are values, not variable names
  • Variables are stored in both rows and columns, contingency table format
  • One type of experimental unit stored in multiple tables
  • Dates in many different formats

What is Tidy Data?

  • Each observation forms a row
  • Each variable forms a column
  • Contained in a single table
  • Long form makes it easier to reshape in many different ways
  • Wide form is common for analysis

Tidy Data

Messy Data

Tidy Verbs

  • gather: specify the keys (identifiers) and the values (measures) to make long form (used to be called melting)
  • spread: variables in columns (used to be called casting)
  • nest/unnest: working with lists
  • separate/unite: split and combine columns

French Fries

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0
1 1 15 1 1.2 0.1 0.0 1.1 5.1
1 1 15 2 8.8 3.0 3.6 1.5 2.3

What Would We Like to Know?

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do ratings change over the weeks?

Each of these questions involves different summaries of the data.

Gathering

  • When gathering, you need to specify the keys (identifiers) and the values (measures).

Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables)

Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities

Gathering the French Fries

ff_long <- gather(french_fries, key = variable, value = 
                    rating, potato:painty)
head(ff_long)
#>   time treatment subject rep variable rating
#> 1    1         1       3   1   potato    2.9
#> 2    1         1       3   2   potato   14.0
#> 3    1         1      10   1   potato   11.0
#> 4    1         1      10   2   potato    9.9
#> 5    1         1      15   1   potato    1.2
#> 6    1         1      15   2   potato    8.8

Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (perhaps displaying in a table).

This is called "spreading" the data.

Spread

We use the spread function from tidyr to do this:

french_fries_wide <- spread(ff_long, key = variable, 
                            value = rating)

head(french_fries_wide)
#>   time treatment subject rep buttery grassy painty potato rancid
#> 1    1         1       3   1     0.0    0.0    5.5    2.9    0.0
#> 2    1         1       3   2     0.0    0.0    0.0   14.0    1.1
#> 3    1         1      10   1     6.4    0.0    0.0   11.0    0.0
#> 4    1         1      10   2     5.9    2.9    0.0    9.9    2.2
#> 5    1         1      15   1     0.1    0.0    5.1    1.2    1.1
#> 6    1         1      15   2     3.0    3.6    2.3    8.8    1.5

Answer some Questions

  • Easiest question to start is whether the ratings are similar on the different scales, potato'y, buttery, grassy, rancid and painty.

  • We need to gather the data into long form, and make plots facetted by the scale.

Ratings on the Different Scales

ff.m <- french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep)
head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8

ggplot(data=ff.m, aes(x=rating)) + geom_histogram(binwidth=2) + 
  facet_wrap(~type, ncol=5) 

Side-by-side boxplots

ggplot(data=ff.m, aes(x=type, y=rating, fill=type)) + 
  geom_boxplot()

Do the Replicates Look Like Each Other?

We will start to tackle this by plotting the replicates against each other using a scatterplot.

We need to gather the data into long form, and then get the replicates spread into separate columns.

Check Replicates

head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
ff.s <- ff.m %>% spread(rep, rating)
head(ff.s)
#>   time treatment subject    type   1    2
#> 1    1         1       3 buttery 0.0  0.0
#> 2    1         1       3  grassy 0.0  0.0
#> 3    1         1       3  painty 5.5  0.0
#> 4    1         1       3  potato 2.9 14.0
#> 5    1         1       3  rancid 0.0  1.1
#> 6    1         1      10 buttery 6.4  5.9

Check Replicates

ggplot(data=ff.s, aes(x=`1`, y=`2`)) + geom_point() +
  theme(aspect.ratio=1) + xlab("Rep 1") + ylab("Rep 2")
ggplot(data=ff.s, aes(x=`1`, y=`2`)) + geom_point() +
  theme(aspect.ratio=1) + xlab("Rep 1") + ylab("Rep 2") + 
  scale_x_sqrt() + scale_y_sqrt()

Your Turn

Make the scatterplots of reps against each other separately for scales, and treatment.

Your Turn

Read in the billboard top 100 music data, which contains N'Sync and Backstreet Boys songs that entered the billboard charts in the year 2000

billboard <- read.csv("../data/billboard.csv")

What's in this data? What's X1-X76?

Your Turn

  1. Use tidyr to convert this data into a long format appropriate for plotting a time series (date on the x axis, chart position on the y axis)
  2. Use ggplot2 to create this time series plot:

dplyr Verbs

The package dplyr helps to make various summaries of the data. There are five primary dplyr verbs, representing distinct data analysis tasks:

  • Filter: Remove the rows of a data frame, producing subsets
  • Arrange: Reorder the rows of a data frame
  • Select: Select particular columns of a data frame
  • Mutate: Add new columns that are functions of existing columns
  • Summarise: Create collapsed summaries of a data frame

The Split-Apply-Combine Approach

(Diagram originally from Hadley Wickham)

Split-Apply-Combine in dplyr

french_fries_split <- group_by(ff_long, variable) # SPLIT
french_fries_apply <- summarise(french_fries_split, 
     rating = mean(rating, na.rm = TRUE)) # APPLY + COMBINE
french_fries_apply
#> # A tibble: 5 × 2
#>   variable rating
#>      <chr>  <dbl>
#> 1  buttery   1.82
#> 2   grassy   0.66
#> 3   painty   2.52
#> 4   potato   6.95
#> 5   rancid   3.85

Filter

french_fries %>%
    filter(subject == 3, time == 1)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         2       3   1   13.9     0.0    0.0    3.9    0.0
#> 4    1         2       3   2   13.4     0.1    0.0    1.5    0.0
#> 5    1         3       3   1   14.1     0.0    0.0    1.1    0.0
#> 6    1         3       3   2    9.5     0.0    0.6    2.8    0.0

Arrange

french_fries %>%
    arrange(desc(rancid)) %>%
    head
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    9         2      51   1    7.3     2.3      0     15    0.1
#> 2   10         1      86   2    0.7     0.0      0     14   13.1
#> 3    5         2      63   1    4.4     0.0      0     14    0.6
#> 4    9         2      63   1    1.8     0.0      0     14   12.3
#> 5    5         2      19   2    5.5     4.7      0     13    4.6
#> 6    4         3      63   1    5.6     0.0      0     13    4.4

Select

french_fries %>%
    select(time, treatment, subject, rep, potato) %>%
    head
#>    time treatment subject rep potato
#> 61    1         1       3   1    2.9
#> 25    1         1       3   2   14.0
#> 62    1         1      10   1   11.0
#> 26    1         1      10   2    9.9
#> 63    1         1      15   1    1.2
#> 27    1         1      15   2    8.8

Summarise

french_fries %>%
    group_by(time, treatment) %>%
    summarise(mean_rancid = mean(rancid), 
              sd_rancid = sd(rancid))
#> Source: local data frame [30 x 4]
#> Groups: time [?]
#> 
#>      time treatment mean_rancid sd_rancid
#>    <fctr>    <fctr>       <dbl>     <dbl>
#> 1       1         1         2.8       3.2
#> 2       1         2         1.7       2.7
#> 3       1         3         2.6       3.2
#> 4       2         1         3.9       4.4
#> 5       2         2         2.1       3.1
#> 6       2         3         2.5       3.4
#> 7       3         1         4.7       3.9
#> 8       3         2         2.9       3.8
#> 9       3         3         3.6       3.6
#> 10      4         1         2.1       2.4
#> # ... with 20 more rows

Your Turn

  • Use summarise to compute the mean science score for each country in the OECD PISA data

String Manipulation

When the experimental design is packed into column names, we need to extract it, and tidy it up.

genes <- read_csv("../data/genes.csv")
kable(head(genes))
id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.2 2.20 4.2 2.63 5.1 4.5 5.5 4.4 3.9 4.2 3.73
Gene 2 1.5 0.59 1.9 0.52 2.9 1.4 3.0 1.3 1.2 1.2 0.89
Gene 3 2.0 0.87 3.3 0.53 4.6 2.2 5.6 2.5 2.5 3.0 1.35

Gather Column Names into Long Form

gather(genes, variable, expr, -id) %>% kable
id variable expr
Gene 1 WI-6.R1 2.18
Gene 2 WI-6.R1 1.46
Gene 3 WI-6.R1 2.03
Gene 1 WI-6.R2 2.20
Gene 2 WI-6.R2 0.59
Gene 3 WI-6.R2 0.87
Gene 1 WI-6.R4 4.20
Gene 2 WI-6.R4 1.86
Gene 3 WI-6.R4 3.28
Gene 1 WM-6.R1 2.63
Gene 2 WM-6.R1 0.52
Gene 3 WM-6.R1 0.53
Gene 1 WM-6.R2 5.06
Gene 2 WM-6.R2 2.88
Gene 3 WM-6.R2 4.63
Gene 1 WI-12.R1 4.54
Gene 2 WI-12.R1 1.36
Gene 3 WI-12.R1 2.18
Gene 1 WI-12.R2 5.53
Gene 2 WI-12.R2 2.96
Gene 3 WI-12.R2 5.56
Gene 1 WI-12.R4 4.41
Gene 2 WI-12.R4 1.31
Gene 3 WI-12.R4 2.54
Gene 1 WM-12.R1 3.85
Gene 2 WM-12.R1 1.18
Gene 3 WM-12.R1 2.54
Gene 1 WM-12.R2 4.18
Gene 2 WM-12.R2 1.19
Gene 3 WM-12.R2 3.04
Gene 1 WM-12.R4 3.73
Gene 2 WM-12.R4 0.89
Gene 3 WM-12.R4 1.35

Separate Columns

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  kable
id trt leftover expr
Gene 1 WI 6.R1 2.18
Gene 2 WI 6.R1 1.46
Gene 3 WI 6.R1 2.03
Gene 1 WI 6.R2 2.20
Gene 2 WI 6.R2 0.59
Gene 3 WI 6.R2 0.87
Gene 1 WI 6.R4 4.20
Gene 2 WI 6.R4 1.86
Gene 3 WI 6.R4 3.28
Gene 1 WM 6.R1 2.63
Gene 2 WM 6.R1 0.52
Gene 3 WM 6.R1 0.53
Gene 1 WM 6.R2 5.06
Gene 2 WM 6.R2 2.88
Gene 3 WM 6.R2 4.63
Gene 1 WI 12.R1 4.54
Gene 2 WI 12.R1 1.36
Gene 3 WI 12.R1 2.18
Gene 1 WI 12.R2 5.53
Gene 2 WI 12.R2 2.96
Gene 3 WI 12.R2 5.56
Gene 1 WI 12.R4 4.41
Gene 2 WI 12.R4 1.31
Gene 3 WI 12.R4 2.54
Gene 1 WM 12.R1 3.85
Gene 2 WM 12.R1 1.18
Gene 3 WM 12.R1 2.54
Gene 1 WM 12.R2 4.18
Gene 2 WM 12.R2 1.19
Gene 3 WM 12.R2 3.04
Gene 1 WM 12.R4 3.73
Gene 2 WM 12.R4 0.89
Gene 3 WM 12.R4 1.35

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") %>% kable
id trt time rep expr
Gene 1 WI 6 R1 2.18
Gene 2 WI 6 R1 1.46
Gene 3 WI 6 R1 2.03
Gene 1 WI 6 R2 2.20
Gene 2 WI 6 R2 0.59
Gene 3 WI 6 R2 0.87
Gene 1 WI 6 R4 4.20
Gene 2 WI 6 R4 1.86
Gene 3 WI 6 R4 3.28
Gene 1 WM 6 R1 2.63
Gene 2 WM 6 R1 0.52
Gene 3 WM 6 R1 0.53
Gene 1 WM 6 R2 5.06
Gene 2 WM 6 R2 2.88
Gene 3 WM 6 R2 4.63
Gene 1 WI 12 R1 4.54
Gene 2 WI 12 R1 1.36
Gene 3 WI 12 R1 2.18
Gene 1 WI 12 R2 5.53
Gene 2 WI 12 R2 2.96
Gene 3 WI 12 R2 5.56
Gene 1 WI 12 R4 4.41
Gene 2 WI 12 R4 1.31
Gene 3 WI 12 R4 2.54
Gene 1 WM 12 R1 3.85
Gene 2 WM 12 R1 1.18
Gene 3 WM 12 R1 2.54
Gene 1 WM 12 R2 4.18
Gene 2 WM 12 R2 1.19
Gene 3 WM 12 R2 3.04
Gene 1 WM 12 R4 3.73
Gene 2 WM 12 R4 0.89
Gene 3 WM 12 R4 1.35

gtidy <- genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") %>%
  mutate(trt = sub("W", "", trt)) %>%
  mutate(rep = sub("R", "", rep))
kable(head(gtidy))
id trt time rep expr
Gene 1 I 6 1 2.18
Gene 2 I 6 1 1.46
Gene 3 I 6 1 2.03
Gene 1 I 6 2 2.20
Gene 2 I 6 2 0.59
Gene 3 I 6 2 0.87

Your Turn

  1. Using the tidied dataset (gtidy), find the mean expression for each combination of id, trt, and time.
  2. Use this tidied data to make this plot.

Re-structuring the Temperature Data

melbtemp.m <- melbtemp %>%
  select(num_range("V", c(1,2,3,4,seq(5,128,4)))) %>%
  filter(V4 %in% c("PRCP", "TMAX", "TMIN")) %>%
  gather(day, value, V5:V125, na.rm = TRUE) %>%
  spread(V4, value) %>%
  mutate(
    tmin = as.numeric(TMIN) / 10,
    tmax = as.numeric(TMAX) / 10,
    t_range = tmax - tmin,
    prcp = as.numeric(PRCP) / 10
  ) %>%
  rename(stn=V1, year=V2, month=V3)

kable(head(melbtemp.m))
stn year month day PRCP TMAX TMIN tmin tmax t_range prcp
ASN00086282 1970 7 V101 0 158 78 7.8 16 8.0 0.0
ASN00086282 1970 7 V105 13 149 36 3.6 15 11.3 1.3
ASN00086282 1970 7 V109 3 133 61 6.1 13 7.2 0.3
ASN00086282 1970 7 V113 0 143 46 4.6 14 9.7 0.0
ASN00086282 1970 7 V117 25 150 42 4.2 15 10.8 2.5
ASN00086282 1970 7 V121 0 145 63 6.3 14 8.2 0.0

melbtemp.m$day <- factor(melbtemp.m$day, 
  levels=c("V5","V9","V13","V17","V21","V25","V29",
           "V33","V37","V41","V45","V49","V53","V57",
           "V61","V65","V69","V73","V77","V81","V85",
           "V89","V93","V97","V101","V105","V109",
           "V113","V117","V121","V125"),
  labels=1:31)
melbtemp.m$date <- as.Date(paste(melbtemp.m$day, 
     melbtemp.m$month, melbtemp.m$year, sep="-"),
     "%d-%m-%Y")

kable(head(melbtemp.m))
stn year month day PRCP TMAX TMIN tmin tmax t_range prcp date
ASN00086282 1970 7 25 0 158 78 7.8 16 8.0 0.0 1970-07-25
ASN00086282 1970 7 26 13 149 36 3.6 15 11.3 1.3 1970-07-26
ASN00086282 1970 7 27 3 133 61 6.1 13 7.2 0.3 1970-07-27
ASN00086282 1970 7 28 0 143 46 4.6 14 9.7 0.0 1970-07-28
ASN00086282 1970 7 29 25 150 42 4.2 15 10.8 2.5 1970-07-29
ASN00086282 1970 7 30 0 145 63 6.3 14 8.2 0.0 1970-07-30

Re-structuring Tuberculosis Data

tb_tidy <- tb %>%
  gather(demographic, cases, m_04:f_u, na.rm = TRUE) %>%
  separate(demographic, c("sex", "age"), "_") %>%
  rename(country = iso2) %>%
  arrange(country, year, sex, age) 
kable(head(tb_tidy))
country year sex age cases
AD 1996 f 014 0
AD 1996 f 1524 1
AD 1996 f 2534 1
AD 1996 f 3544 0
AD 1996 f 4554 0
AD 1996 f 5564 1

Dates and Times

Dates are deceptively hard to work with.

Example: 02/05/2012. Is it February 5th, or May 2nd?

Other things are difficult too:

  • Time zones
  • POSIXct format in base R is challenging

The lubridate, and timeDate package helps tackle some of these issues.

Basic Lubridate Use

now()
#> [1] "2016-12-01 06:46:07 AEDT"
today()
#> [1] "2016-12-01"
now() + hours(4)
#> [1] "2016-12-01 10:46:07 AEDT"
today() - days(2)
#> [1] "2016-11-29"

Parsing Dates

ymd("2013-05-14")
#> [1] "2013-05-14"
mdy("05/14/2013")
#> [1] "2013-05-14"
dmy("14052013")
#> [1] "2013-05-14"
ymd_hms("2013:05:14 14:5:30", tz = "Australia/Melbourne")
#> [1] "2013-05-14 14:05:30 AEST"

Extracting Temporal Elements

month(ymd("2013-05-14"))
#> [1] 5
year(ymd("2013-05-14"))
#> [1] 2013
wday(ymd("2013-05-14"), label=TRUE, abbr=TRUE)
#> [1] Tues
#> Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
isWeekday(ymd("2013-05-14"))
#> 2013-05-14 
#>       TRUE

Your Turn

For the pedestrian sensor data, extract

  • year
  • month
  • day of the week

from the date variable

Cognitive Principles

  • Hierarchy of mappings: (first) position along an axis - (last) color (Cleveland, 1984; Heer and Bostock, 2009)
  • Pre-attentive: Some elements like color are noticed before you even realise it. Other elements like axes are to look up information later.
  • Color palettes: qualitative, sequential, diverging. The type of variable determines the appropriate palette.
  • Color blindness: you can proof your plots with te dichromat package.
  • Proximity: To compare elements, place them close together.
  • Change blindness: When focus is interrupted differences may not be noticed, can occur when you are reading across multiple plots.

Hierarchy of Mappings

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

Pre-attentive

Can you find the odd one out?

Is it easier now?

Color Palettes

  • Qualitative: categorical variables
  • Sequential: low to high numeric values
  • Diverging: negative to positive values

Scales

internet <- read_csv("../data/internet.csv")
ggplot(data=internet, aes(x=`Social networks`, fill=Gender)) + 
  geom_bar(position="dodge") +
  scale_fill_manual(values=c("Female"="orange", "Male"="darkgreen")) + 
  facet_wrap(~name, ncol=5) + 
  theme(legend.position="bottom")

Scales

ggplot(data=grad, aes(x=subject, y=AvGREs)) + 
  geom_boxplot() + scale_y_log10()

Axes

The date time axis is a little trickier to re-organise, but it can be done.

ggplot(data=rates.sub.m, aes(x=date, y=rate, 
  colour=currency)) + geom_line() + 
  scale_x_datetime(breaks = date_breaks("1 month"), 
                   labels = date_format("%b"))

ggplot(data=rates.sub.m, aes(x=date, y=rate, 
        colour=currency)) + geom_line() +
  xlab("Date") + ylab("Standardized rates") + 
  ggtitle("Cross rates 23/2/2015-11/11/2015")

Equations in Labels

ggplot(data=rates.sub.m, aes(x=date, y=rate, colour=currency)) +
  geom_line() +
  xlab(expression(Date[i]^2~ mu ~ pi * sigma)) + 
  ylab("Standardized rates") + 
  ggtitle("Cross rates 23/2/2015-11/11/2015")

Save the plot to an R object, p, for re-use

p <- ggplot(data=rates.sub.m, aes(x=date, y=rate, 
  colour=currency)) + geom_line() + 
  scale_x_datetime(breaks = date_breaks("1 month"), 
                   labels = date_format("%b"))

Legend Position

p + theme(legend.position = "bottom")

Themes

p + theme_tufte()

p + theme_economist()

Color Palettes

p + scale_color_brewer("", palette = "Dark2")

Color blind-proofing

clrs <- hue_pal()(3)
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

clrs <- dichromat(hue_pal()(3))
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

clrs <- brewer.pal(3, "Dark2")
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

clrs <- dichromat(brewer.pal(3, "Dark2"))
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

Your Turn

Proximity - From the plot can you answer: Is the proportion of girls who use social networks every day (4) higher than boys, in Australia? And is this different in Germany?

Your Turn

  • Brainstorm with your neighbour ways to rearrange this plot to answer the question.

  • Then tackle this question: Are German girls more likely to report using social networks once or twice per month (1) than Japanese girls?

  • What ways would you re-arrange the plot to tackle this one?

Proximity

  • It is ok to make more than one plot.
  • Actually it is recommended.

Next

Advanced graphics

Credits